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Abstract 

The  Alewife  multiprocessor  project  focuses  on  the  architecture  and  de¬ 
sign  of  a  large-scale  parallel  machine.  The  machine  uses  a  low  dimension 
direct  interconnection  network  to  provide  scalable  communication  band¬ 
width,  while  allowing  the  exploitation  of  locality.  Despite  its  distributed 
memory  architecture,  Alewife  allows  efRcient  shared  memory  program¬ 
ming  through  a  multilayered  approach  to  locality  management.  A  new 
scalable  cache  coherence  scheme  crdied  LimitLESS  directories  allows  the 
use  of  caches  for  reducing  communication  latency  and  network  bandwidth 
requirements.  Alewife  also  employs  run-time  and  compile-time  methods 
for  partitioning  and  placement  of  data  and  processes  to  enhance  commu¬ 
nication  locality.  While  the  above  methods  attempt  to  minimize  commu¬ 
nication  latency,  remote  communication  with  distant  processors  cannot 
be  completely  avoided.  Alewife’s  processor,  Sparcle,  is  designed  to  toler¬ 
ate  these  latencies  by  rapidly  switching  between  threads  of  computation. 
This  paper  describes  the  Alewife  architecture  and  concentrates  on  the 
novel  hardware  features  of  the  machine  including  LimitLESS  directories 
and  the  rapid  context  switching  processor. 


1  Introduction 

High-performance  computer  design  is  driven  by  the  need  to  solve  important 
problems  efficiently  and  at  a  reasonable  cost.  While  single-processor  perfor¬ 
mance  is  limited  by  physical  constraints,  advances  in  technology  make  machines 
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with  thousands  of  processors  feasible.  Highly  parallel  machines  offer  significant 
cost-performance  benefits  over  single  processor  machines. 

Parallel  machines  are  commonly  organized  as  a  set  of  nodes  that  communicate 
over  an  interconnection  network,  each  node  containing  a  processor  and  some 
memory.  From  the  perspective  of  a  node  in  a  real  machine  built  in  three  di¬ 
mensional  space,  some  noues  will  be  physically  closer  than  others.  Informally, 
a  program  running  on  a  parallel  machine  displays  communication  locality  (or 
memory  reference  locality)  if  the  probability  of  communication  (or  access)  to 
various  nodes  decreases  with  physical  distance.  Communication  locality  in  par¬ 
allel  programs  depends  on  the  application  as  well  as  on  partitioning  and  place¬ 
ment  of  data  and  processes. 

Parallel  machines  are  scalable  if  they  can  exploit  communication  locality  in 
parallel  programs.  That  is,  for  programs  that  display  communication  locality, 
scalable  machines  can  offer  proportionally  better  performance  with  more  pro¬ 
cessing  nodes  [29].  Scalable  machines  are  easily  programmable  if  they  provide 
automatic  enhancement  of  communication  locality  in  parallel  programs. 

The  Alewife  experiment  explores  methods  for  automatic  enhancement  of  locality 
in  a  scalable  parallel  machine.  The  Alewife  multiprocessor  uses  a  distributed 
shared-memory  architecture  with  a  low-dimension  direct  network.  Such  net¬ 
works  ate  cost-effective,  modular,  and  encourage  the  exploitation  of  local¬ 
ity  [34,  19,  2].  Unfortunately,  non-uniform  communication  latencies  usually 
make  such  machines  hard  to  program  because  the  onus  of  managing  locality 
invariably  falls  on  the  programmer.  The  goal  of  the  Alewife  project  is  to  dis¬ 
cover  and  to  evaluate  techniques  for  automatic  locality  management  in  scalable 
multiprocessors. 

Alewife  uses  a  multilayered  approach  to  achieve  this  goal,  consisting  of  tech¬ 
niques  for  latency  minimization  and  latency  tolerance.  The  compiler,  runtime 
system,  and  hardware  cooperate  to  enhance  communication  locality,  thereby 
reducing  average  communication  latency  and  required  network  bandwidth.  Be¬ 
cause  remote  communication  with  distant  processors  cannot  always  be  avoided, 
Aiewife’s  processor  tolerates  the  resulting  latencies  by  rapidly  switching  be¬ 
tween  threads  of  computation. 

This  paper  focuses  on  the  organization  of  the  Alewife  machine  and  describes  its 
hardware  features  for  automatic  locality  management.  These  features  include 
shared-data  caching,  made  possible  by  a  new  cache  coherence  scheme  called 
LimitLESS  directories,  and  rapid  context  switching.  VVe  present  an  overview 
of  our  approach  to  locality  management  in  Section  2,  and  describe  the  machine 
organization  and  the  programming  environment  in  Section  3.  Section  4  dis¬ 
cusses  the  LimitLESS  directory  scheme,  and  Section  5  outlines  our  approach 
to  latency  tolerance.  We  also  discuss  the  performance  of  the  machine  on  a  few 
applications.  Other  details  of  the  machine  are  presented  elsewhere  [4,  8,  28]. 
Section  6  discusses  related  work,  and  Section  7  offers  some  perspective  and 
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summarizes  the  paper. 


2  System  Overview 


The  Alewife  compiler,  runtime  system,  and  hardware  try  to  reduce  the  commu¬ 
nication  latency  where  possible,  and  attempt  to  tolerate  the  latency  otherwise. 
We  are  developing  compiler  technology  to  enhance  the  static  communication 
locality  of  applications.  Programs  are  first  transformed  into  an  intermediate 
task  graph  representation  called  WAIF  [27],  where  the  communication  between 
threads  is  exposed  through  program  analysis.  Succeeding  stages  of  the  com¬ 
piler  map  the  task  graph  on  to  the  machine  and  attempt  to  minimize  overall 
execution  time.  When  the  compiler  lacks  enough  information  to  make  good 
placemcT't  decisions,  it  relegates  the  responsibility  to  the  runtime  layer. 

Run-time  software  participates  in  enhancing  locality  through  lazy  task  creation, 
a  novel  dynamic  partitioning  method  [28],  and  intelligent  scheduling.  In  a 
dynamic  partitioning  system  the  programmer  or  compiler  can  expose  all  of  the 
parallelism  in  an  application,  but  new  tasks  will  be  created  at  runtime  only  when 
there  are  idle  processors.  To  enhance  the  likelihood  of  placing  related  tasks  close 
to  each  other,  a  locality-based  tree  scheduler  determines  the  order  in  which  idle 
processors  search  for  new  tasks.  To  reduce  the  network  bandwidth  consumed  by 
the  searching  processors,  only  single  representatives  from  neighborhoods  search 
for  work.  Simulations  of  several  parallel  applications  with  64  processors  showed 
that  a  mesh  network  yielded  roughly  the  same  speedup  as  a  more  expensive 
multistage  network,  when  both  used  lazy  task  partitioning,  a  tree  scheduler, 
and  coherent  caches. 


Caching  shared  data  is  Alewife’s  hardwatre  method  for  reducing  memory  access 
latency.  With  caches,  the  software  does  not  need  to  worry  as  much  about  careful 
initial  data  placement;  the  caches  dynamically  move  data  objects  close  to  the 
processor,  so  accesses  are  satisfied  completely  within  a  node.  A  new  scalable 
scheme  called  LimiiLESS  directories  solves  the  cache  coherence  problem.  The 
LimitLESS  directory  is  a  small  set  of  pointers  (say  4)  distributed  along  with 
each  block  of  main  memory  that  tracks  copies  of  cached  data  and  maintains 
memory  consistency  by  transmitting  invalidation  messages  over  the  network. 
The  LimitLESS  scheme  allows  a  memory  module  to  interrupt  its  local  processor 
for  software  emulation  of  a  full-map  directory  when  the  small  set  of  pointers 
overflows.  Section  4  describes  and  evaluates  this  scheme. 

If  the  system  cannot  avoid  a  remote  memory  request,  Alewife’s  processor  can 
rapidly  schedule  another  task  in  place  of  the  stalled  process.  Alewife  also  toler¬ 
ates  synchronization  latencies  and  provides  fast  traps  through  the  same  context 
switching  mechanism.  Because  context  switches  are  forced  only  on  memory 
requests  that  require  the  use  of  the  interconnection  network  and  on  synchro- 
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nization  faults,  the  processor  achieves  high  single-thread  performance. 

We  believe  such  a  layered  approach  is  necessary  to  build  truly  general-purpose 
parallel  machines.  Real  applications  are  composed  of  phases,  which  will  benefit 
in  different  proportions  from  the  various  layers.  For  example,  matrix  compu¬ 
tations  can  benefit  from  static  compiler  analysis,  while  combinatorial  search 
problems  will  profit  from  the  runtime  and  cache  layers.  Finally,  efficient  exe¬ 
cution  of  phases  without  inherent  loeality,  such  as  matrix  transpose,  is  possible 
when  the  processors  can  mask  the  latency  of  remote  requests. 


3  Hardware  Organization  of  Alewife 

Figure  1  depicts  the  Alewife  machine  as  a  set  of  processing  nodes  connected  in 
a  mesh  topology.  Each  Alewife  node  consists  of  a  processor,  a  cache,  a  portion 
of  globally-shared  distributed  memory,  a  cache-memory-network  controller,  a 
floating-point  coprocessor,  and  a  network  switch. 

A  single-chip  controller  on  each  node  holds  the  cache  tags  and  implements 
the  cache  coherence  protocol  by  synthesizing  messages  to  other  nodes.  While 
the  Alewife  architecture  is  scalable,  the  number  of  directory  pointer  bits  in 
the  current  implementation  of  our  controller  will  limit  the  maximum  size  of  the 
machine  to  512  nodes.  The  controller  uses  a  simple  message-based  interface  with 
the  network.  Various  forms  of  shared  memory  coherence  models  are  maintained 
by  the  controller  via  messages  to  other  nodes.  Alewife  has  a  simple  memory 
mapping  scheme.  The  top  few  bits  of  the  address  determine  the  node  number, 
and  the  rest  of  the  address  is  the  index  within  the  specific  module. 

As  shown  in  Figure  1,  each  node  contains  a  network  switch  chip,  specifically  the 
Frontier  series  Mesh  Routing  Chip  (FMRC)  from  Caltech.  The  mesh  network 
uses  wormhole  routing  [11]  -  a  variant  of  cut-through  routing  [21].  The  network 
has  eight-bit  channels,  with  a  throughput  of  roughly  lOOM  bytes  per  second  in 
each  direction.  Free  ports  on  peripheral  nodes  of  the  network  are  used  for  I/O, 
monitor,  and  host  connections.  The  prototype  Alewife  system  will  attach  to  a 
host  SUN  backplane  by  interfacing  a  network  switch  to  t.he  VME  bus. 

The  processor  uses  a  memory-reference-based  interface  with  the  controller,  al¬ 
though  the  controller  uses  a  message-based  interface  for  internode  communi¬ 
cations.  Using  a  control  word  associated  with  each  memory  reference,  various 
types  of  synchronization  or  communication  types  are  synthesized  by  the  pro¬ 
cessor.  This  interface  allows  a  simple  implementation  of  the  processor. 

Sparcle,  a  first-round  prototype  based  on  modifications  to  LSI  Logic’s  SPARC 
processor  [36]  implementation,  will  clock  at  33  MHz  and  context  switch  in  11 
cycles.  Each  node  has  64K  bytes  of  direct-mapped  cache  and  4M  bytes  of 
globally-shared  main  memory.  Each  node  has  and  an  additional  4M  bytes  of 
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local  memory,  a  portion  of  which  is  used  for  the  coherence  directory.  Alewife’s 
cache  and  floating-point  units  are  SPARC  compatible.  Sparcle  uses  a  block 
multithreaded  architecture  [4]. 

Initially,  our  software  system  will  be  based  on  Mul-T  [23].  A  parallel  C-like 
language  is  also  under  development.  Mul-T’s  basic  mechanism  for  generating 
concurrent  threads  is  the  future  construct.  The  expression  (future  A^),  where 
X  is  an  arbitrary  expression,  creates  a  task  to  evaluate  X  and  also  creates  an 
object  known  as  a  placeholder  to  hold  eventuzdly  the  value  of  X.  When  created, 
the  future  is  in  an  unresolved  or  undetermined  state.  When  the  value  of  X 
becomes  known,  the  future  resolves  to  that  value,  effectively  mutating  into  the 
value  of  X  and  losing  its  identity  as  a  future.  Concurrency  arises  because  the 
expression  (futurs  X)  returns  the  future  as  its  value  without  waiting  for  the 
future  to  resolve.  Thus,  the  computation  containing  (future  X)  can  proceed 
concurrently  with  the  evaluation  of  X.  The  act  of  suspending  computation  if 
an  object  is  an  unresolved  future  and  then  proceeding  when  the  future  resolves 
is  known  as  touching  the  object. 

Our  processor  will  allow  operators  to  check  for  resolved  futures  with  no  over¬ 
head,  disposing  of  the  60-100%  overhead  incurred  by  the  system  on  other  pro¬ 
cessors.  Support  for  lightweight  full-empty  bit  synchronization  [35]  in  the  pro¬ 
cessor  will  allow  use  of  efficient  fine-graun  parallelism.  In  addition,  the  modified 
SPARC  implementation  is  competitive  in  raw  performance  to  contemporary 


sequential  machines. 

We  propose  to  use  Mul-T  as  our  intermediate  compiler  language,  augmented 
with  primitives  for  specifying  explicit  partitioning  and  placement  of  both  data 
and  processes.  Our  compiler  will  partition  a  program  taking  communication 
costs  into  account,  and  produce  an  extended  Mul-T  program  consisting  of  a 
set  of  tasks  with  granularity  and  placement  information.  The  Orbit  optimizing 
compiler  [13,  22]  will  then  compile  these  tasks  to  Sparcle  machine  code. 

The  design  of  the  Ale  wife  machine  is  in  progress  and  a  detailed  simulator  called 
ASIM  is  operational.  ASIM  implements  several  cache  coherence  protocols  and 
interconnection  network  architectures.  When  ASIM  is  configured  with  its  full 
statistics-gathering  capability,  it  runs  at  about  5000  processor  cycles  per  second 
on  an  unloaded  SPARCserver  330.  At  this  rate,  a  64  processor  machine  runs 
approximately  80  cycles  per  second.  Most  of  the  simulations  that  we  chose  for 
this  paper  run  for  roughly  one  million  cycles  (a  fraction  of  a  second  on  a  real 
machine),  which  takes  3.5  hours  to  complete.  This  lack  of  simulation  speed  is 
one  of  the  primary  reasons  for  implementing  the  Alewife  machine  in  hardware 
—  to  enable  a  thorough  evaluation  of  our  ideas  on  much  larger  applications. 


4  LimitLESS  Directories 

Shared  data  caching  is  an  important  component  of  Alewife’s  multilayered  sys¬ 
tem  for  automatic  locality  management.  Caches  reduce  the  volume  of  traffic 
imposed  on  the  network  by  providing  demand-driven  data  replication  where 
needed.  However,  replicating  blocks  of  data  in  multiple  caches  introduces  the 
cache  coherence  problem  [15,  38).  A  number  of  cache  coherence  protocols  have 
been  proposed  to  solve  the  coherence  problem  in  network-based  multiproces¬ 
sors  [6,  37,  5,  20].  These  message-based  protocols  allocate  a  section  of  the 
system’s  memory,  called  a  directory,  to  store  the  locations  and  state  of  the 
cached  copies  of  each  data  block.  The  protocols  send  messages  with  data  re¬ 
quests  or  invalidation  signals,  and  record  the  acknowledgment  of  each  of  these 
messages  to  ensure  global  consistency  of  memory. 

Although  directory  protocols  have  been  around  since  the  late  1970’s,  the  use¬ 
fulness  of  the  early  protocols  (e.g.,  [6])  was  in  doubt  for  several  reasons:  First, 
the  directory  itself  was  a  centralized  monolithic  resource  which  serialized  all 
requests.  Second,  directory  accesses  were  expected  to  consume  a  disproportion¬ 
ately  large  fraction  of  the  available  network  bandwidth.  Third,  the  directory 
became  prohibitively  large  as  the  number  of  processors  increased.  To  store 
pointers  to  blocks  potentially  cached  by  ail  the  processors  in  the  system,  the 
size  of  the  directory  memory  in  early  full-map  protocols  grows  as  Q(N^),  where 
N  is  the  number  of  processors  in  the  system. 

As  observed  in  [5],  the  first  two  concerns  are  easily  dispelled:  The  directory  can 
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be  distributed  along  with  main  memory  among  the  processing  nodes  to  match 
the  aggregate  bandwidth  of  distributed  main  memory.  Furthermore,  required 
directory  bandwidth  is  not  much  more  than  the  memory  bandwidth,  because 
accesses  destined  to  the  directory  alone  comprise  a  small  fraction  of  all  network 
requests.  Thus,  the  challenge  lies  in  alleviating  the  severe  memory  requirements 
of  the  distributed  full-map  directory  schemes. 

Scalable  coherence  protocols  differ  in  the  size  and  the  structure  of  the  direc¬ 
tory  memory.  Limited  directory  protocols  [5],  for  example,  avoid  the  severe 
memory  overhead  of  full-map  directories  by  allowing  only  a  limited  number  of 
simultaneously  cached  copies  of  any  individual  block  of  data.  Unlike  a  full-map 
directory,  the  size  of  a  imited  directory  grows  as  ©(A^  logiV)  with  the  number 
of  processors.  Once  all  of  the  pointers  in  a  directory  entry  are  filled,  the  pro¬ 
tocol  must  evict  previously  cached  copies  to  satisfy  new  requests  to  read  the 
data  associated  with  the  entry.  In  such  systems,  widely  shared  data  locations 
degrade  system  performance  by  causing  constant  eviction  and  reassignment,  or 
thrashing,  of  directory  pointers.  However,  previous  studies  have  shown  that  a 
small  set  of  pointers  is  sufficient  to  capture  the  worker-set  of  processors  that 
concurrently  read  many  types  of  data  (7,  39,  30].  The  worker-set  of  a  mem¬ 
ory  block  is  defined  as  the  set  of  processors  that  concurrently  read  a  memory 
location,  and  corresponds  to  the  number  of  active  pointers  it  would  have  in  a 
full-map  directory  entry. 

4.1  Overview  of  the  LimitLESS  Protocol 

Alewife  implements  the  LimitLESS  cache  coherence  protocol,  which  nearly  real¬ 
izes  the  performance  of  the  full-map  directory  protocol,  with  the  memory  over¬ 
head  of  a  limited  directory,  but  without  excessive  sensitivity  to  widely  shared 
data.  The  LimitLESS  scheme  implements  a  smaP  s»t  of  n'^mt-’r*’  in  thn  mem¬ 
ory  modules,  as  do  limited  directory  protocols.  But  when  necessary,  the  scheme 
allows  a  memory  module  to  interrupt  the  processor  for  software  emulation  of 
a  full-map  directory.  Its  name  reflects  the  above  properties:  Limited  directory 
locally  Ebctended  through  Software  Support. 

Figure  1  depicts  a  set  of  directory  pointers  that  correspond  to  the  shared  data 
block  X ,  copies  of  which  exist  in  several  caches.  In  the  figure,  the  software  has 
extended  the  directory  pointer  array  (which  is  shaded)  into  local  memory. 

The  structure  of  the  Alewife  machine  provides  for  an  efficient  implementation  of 
this  memory  system  extension.  Since  each  processing  node  in  Alewife  contains 
both  a  memory  controller  and  a  processor,  it  is  a  straightforward  modification 
of  the  architecture  to  couple  the  responsibilities  of  these  two  functional  units, 
using  the  Sparcle  processor's  fast  trap  mechanism. 

The  LimitLESS  scheme  should  not  be  confused  with  schemes  previously  termed 


Component 

Name 

.Meaning 

Memory 

Read-Only 

Read- Write 

Read-Transaction 

Write-Transaction 

Some  caches  have  read-only  copies  of  the  data. 
Exactly  one  cache  has  a  read-write  copy 

Holding  read  request,  update  is  in  progress 

Holding  write  request,  invalidation  is  in  progress. 

Cache 

Invalid 

Read-Only 

Read-Write 

Cache  block  may  not  be  read  or  written. 

Cache  block  may  be  read,  but  not  written 

Cache  block  may  be  read  or  written. 

Table  1:  Directory  slates. 


software-based,  which  require  static  identification  of  non-cacheable  locations. 
Although  the  LimitLESS  scheme  is  partially  implemented  in  software,  it  dynam¬ 
ically  detects  when  coherence  actions  are  required;  consequently,  the  software 
emulation  should  be  considered  a  logical  extension  of  the  hardware  functional¬ 
ity.  To  clarify  the  difference  between  protocols,  schemes  may  be  classified  by 
function  as  static  (compiler-dependent)  or  dynamic  (using  run-time  informa¬ 
tion),  and  by  implementation  ais  software^based  ot  hardware- based. 

4.2  Protocol  Specification 

We  now  describe  the  LimitLESS  directory  protocol  and  the  architectural  inter¬ 
faces  needed  to  implement  it. 

The  LimitLESS  protocol  has  the  same  state  transition  diagram  as  the  full-map 
protocol.  The  memory  controller  side  of  this  protocol  is  illustrated  in  Figure  2. 
which  contains  the  memory  states  listed  in  Table  1.  These  states  are  mirrored 
by  the  state  of  the  block  in  the  caches,  also  listed  in  Table  1.  The  state  tran¬ 
sition  diagram  specifies  liie  states,  ’h**  composition  of  the  pointer  set  (P),  and 
the  transitions  between  the  states.  It  is  the  responsibility  of  the  protocol  to 
keep  the  states  of  the  memory  and  the  cache  blocks  coherent.  The  protocol 
enforces  coherence  by  transmitting  messages  between  the  cache/memory  con¬ 
trollers.  Every  message  contains  the  address  of  a  memory  block,  to  indicate 
which  directory  entry  should  be  used  when  processing  the  message. 

For  example.  Transition  2  from  the  Read-Only  state  to  the  Read- Write  state  is 
taken  when  cache  i  requests  write  permission  (Write  Request)  and  the  pointer 
set  is  empty  or  contains  just  cache  i  (P  =  {}  or  P  =  {i}).  In  this  case,  the 
pointer  set  is  modified  to  contain  i  (if  necessary)  and  the  memory  controller 
issues  a  message  containing  the  data  of  the  block  to  be  written  (Write  Data). 

Following  the  notation  in  [5],  both  full-map  and  LimitLESS  are  members  of  the 
Dirf^N B  class  of  cache  coherence  protocols.  Therefore,  from  the  point  of  view 
of  the  protocol  specification,  the  LimitLESS  scheme  does  not  differ  substantially 
from  ihe  full-map  protocol.  In  fact,  the  LimitLESS  protocol  is  also  specified  in 
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Figure  2;  Directory  state  transition  diagram. 


Figure  2.  The  extra  notation  on  the  Read-Only  ellipse  (S  .  n  >  p)  indicates  that 
the  state  is  handled  in  software  when  the  size  of  the  pointer  set  (n)  is  greater 
than  the  size  of  the  limited  directory  (p).  (See  [8]  for  details).  In  this  situation, 
the  transitions  with  the  square  labels  (1,2,  and  3)  are  executed  by  the  interrupt 
handler  on  the  processor  that  is  local  to  the  overflowing  directory.  When  the 
protocol  changes  from  a  software-handled  state  to  a  hardware-handled  state, 
the  processor  must  modify  the  directory  state  so  that  the  memory  controller 
can  resume  responsibility  for  the  protocol  transitions 


4.3  Interfaces  for  LimitLESS 

This  section  outlines  the  architectural  features  and  hardware  interfaces  needed 
to  support  the  LimitLESS  directory  scheme.  To  support  the  LimitLESS  proto¬ 
col  efficiently,  a  cache-based  multiprocessor  needs  several  properties.  First,  it 
must  be  capable  of  rapid  trap  handling.  Sparcle  permits  execution  of  trap  code 
within  five  to  ten  cycles  from  the  time  a  trap  is  initiated. 

Second,  the  processor  needs  complete  access  to  coherence  related  controller 
state  such  as  pointers  and  state  bits  in  the  hardware  directories.  Similarly 
the  directory  controller  must  be  able  to  invoke  processor  trap  handlers  when 
necessary.  The  hardware  interface  between  the  Alewife  processor  and  controller, 
depicted  in  Figure  3,  is  designed  to  meet  these  requirements.  The  address 
and  data  buses  permit  processor  manipulation  of  controller  state  and  initiation 
of  actions  via  load  and  store  instructions  to  memory-mapped  1/0  space,  in 
Alewife,  the  directories  are  placed  in  this  special  region  of  memory  distinguished 
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Figure  3:  Signals  between  processor  and  controller. 


from  normal  memory  space  by  a  distinct  Alternate  Space  Indicator  ( ASI).  The 
controller  returns  two  condition  bits  and  several  trap  lines  to  the  processor. 

Finally,  a  machine  implementing  the  LimitLESS  scheme  needs  an  interface  to 
the  network  that  allows  the  processor  to  launch  and  to  intercept  coherence 
protocol  packets.  Although  most  shared-memory  multiprocessors  export  little 
or  no  network  functionality  to  the  processor,  Alewife  provides  the  processor  with 
direct  network  access  through  the  Interprocessor-Interrupt  (IPl)  mechanism. 

The  Alewife  machine  supports  a  complete  interface  to  the  interconnection  net¬ 
work.  This  interface  provides  the  processor  with  a  superset  of  the  network 
functionality  needed  by  the  cache-coherence  hardware.  Not  only  can  it  be  used 
to  send  and  receive  cache  protocol  packets,  but  it  can  also  be  used  to  send 
preemptive  messages  to  remote  processors  (as  in  message-passing  machines), 
hence  the  name  Interprocessor-Interrupt. 

We  stress  that  the  IPI  interface  is  a  single  generic  mechanism  for  network  access 
-  noi  a  conglomeration  of  different  mechanisms.  The  power  of  such  a  mechanism 
lies  in  its  generality. 

The  current  implementation  of  the  LimitLESS  trap  handler  is  as  follows;  when 
an  overflow  trap  occurs  for  the  first  time  on  a  given  memory  line,  the  trap 
code  allocates  a  full-map  bit-vector  in  local  memory.  This  vector  is  entered 
into  a  hash  table.  All  hardware  pointe-s  are  emptied  and  the  corresponding 
bits  are  set  in  this  vector.  The  directory  state  for  that  block  is  tagged  Trap- 
On-Write.  Emptying  the  hardware  pointers  allows  the  controller  to  continue 
handling  read  requests  until  the  next  pointer  array  overflow  and  maucimizes  the 
number  of  transactions  serviced  in  hardware.  However,  the  memory  controller 
must  interrupt  the  processor  upon  a  write  request.  When  additional  overflow 
trapis  occur,  the  trap  code  locates  the  full-map  vector  in  the  hash  table,  empties 
the  hardware  pointers,  and  sets  the  appropriate  bits  in  the  vector. 

Software  handling  of  a  memory  line  terminates  when  the  processor  traps  on  an 
incoming  write  request  or  local  write  fault.  The  trap  handler  finds  the  full-map 
bit  vector  and  empties  the  hardware  pointers  as  above.  Next,  it  records  the 
identity  of  the  requester  in  the  directory,  sets  the  acknowledgment  counter  to 
the  number  of  bits  in  the  vector  that  are  set,  and  places  the  directory  in  its 


10 


normal  Write  Transaction  state.  Finally,  it  sends  invalidations  to  all  caches  with 
bits  set  in  the  vector.  The  vector  may  now  be  freed.  At  this  point,  the  memory 
line  h2ts  returned  to  hwdware  control  When  all  invalidations  are  acknowledged, 
the  hardware  will  send  the  data  with  write  permission  to  the  requester. 

4.4  Performance  Measurements 

This  section  presents  some  preliminary  result,  from  the  Alewife  system  simula¬ 
tor,  comparing  the  performance  of  limited,  LimitLESS,  and  full-map  directories. 
The  protocols  are  evaluated  in  terms  of  the  total  number  of  cycles  needed  to 
execute  an  application  on  a  64  processor  Alewife  machine.  Using  execution 
cycles  as  a  metric  emphasizes  the  bottom  line  of  multiprocessor  design:  how 
fast  a  system  can  run  a  program. 

The  results  presented  below  are  derived  from  complete  Alewife  machine  simu¬ 
lations  and  from  dynamic  post-mortem  scheduler  simulations.  The  complete- 
machine  simulator  runs  programs  written  in  the  Mul-T  language,  optimized 
by  the  Mul-T  compiler,  and  linked  with  a  runtime  system  that  implements 
both  static  work  distribution  and  dynamic  task  partitioning  and  scheduling. 
Post-mortem  scheduling,  on  the  other  hand,  generates  a  parallel  trace  from 
a  uniprocessor  execution  trace  that  has  embedded  synchronization  informa¬ 
tion  [9].  The  post-mortem  scheduler  was  implemented  by  Mathews  Cherian 
with  Kimming  So  at  IBM.  The  post-mortem  scheduler  has  been  modified  to 
incorporate  feedback  from  the  network  in  issuing  trace  requests  [25]. 

To  evaluate  the  benefits  of  the  LimitLESS  coherence  scheme,  we  implemented 
an  approximation  of  the  new  protocol  in  ASIM.  During  the  simulations,  ASIM 
simulates  an  ordinary  full- map  protocol,  but  when  the  simulator  encounters  a 
pointer  array  overflow,  it  stalls  both  the  memory  controller  and  the  processor 
that  would  handle  the  LimitLESS  interrupt  for  T,  cycles.  The  current  imple¬ 
mentation  of  the  LimitLESS  software  trap  handlers  in  Alewife  suggests  T,  Si  50. 

Table  2  shows  the  simulated  performance  of  four  applications,  using  a  four- 
pointer  limited  protocol  {Dir^N B),  a  full-map  protocol,  and  a  LimitLESS 
(LimitLESS4)  scheme  with  T,  =  50.  All  of  the  runs  simulate  a  64-node  Alewife 
machine  with  64K  byte  caches  and  a  two-dimensional  mesh  network. 

Multigrid  is  a  statically  scheduled  relaxation  program.  Weather  forecasts  the 
state  of  the  atmosphere  given  an  initial  state,  SIMPLE  simulates  the  hydrody¬ 
namic  and  thermal  behavior  of  fluids,  and  Matexpr  performs  several  multiplica¬ 
tions  and  additions  of  various  sized  matrices.  The  computations  in  Matexpr  are 
partitioned  and  scheduled  by  a  compiler.  Weather  and  SIMPLE  are  measured 
using  dynamic  post-mortem  scheduling  of  traces,  while  Multigrid  and  Matexpr 
are  run  on  complete-machine  simulations. 

Since  the  LimitLESS  scheme  implements  a  full-fledged  limited  directory  in  hard- 
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Application 

Dir^NB 

Full-Map 

Multigrid 

SIMPLE 

3.579 

2.553 

Matexpr 

1.296 

Weather 

1.356 

0.621 

Table  2:  Performance  for  three  coherence  schemes,  in  terms  of  millions  of  cycles. 

ware,  applications  that  perform  well  using  a  limited  scheme  also  perform  well 
using  LimitLESS.  Multigrid  is  such  an  application.  All  of  the  protocols  require 
approximately  the  same  time  to  complete  the  computation  phase.  This  confirms 
the  assumption  that  for  applications  with  small  worker-sets,  such  as  multigrid, 
the  limited  (and  therefore  the  LimitLESS)  directory  protocols  perform  almost 
as  well  as  the  full-map  protocol.  See  [7]  for  more  evidence  of  the  general  success 
of  limited  directory  protocols. 

To  measure  the  performance  of  LimitLESS  under  extreme  conditions,  we  sim¬ 
ulated  a  version  of  SIMPLE  with  barrier  synchronization  implemented  using  a 
single  lock  (rather  than  a  software  combining  tree).  Although  the  worker-sets 
in  SIMPLE  are  small  for  the  most  part,  the  globally  shared  barrier  structure 
causes  the  performance  of  the  limited  directory  protocol  to  suffer.  In  contrast, 
the  LimitLESS  scheme  is  less  sensitive  to  wide-spread  sharing. 

The  Matexpr  application  uses  several  variables  that  have  worker-sets  of  up  to  16 
processors.  Due  to  these  large  worker-sets  time  with  the  LimitLESS  scheme  is 
almost  double  that  with  the  full-map  protocol.  The  limited  protocol,  however, 
exhibits  a  much  higher  sensitivity  to  the  large  worker-sets. 

Although  software  combining  trees  distribute  barrier  synchronization  variables 
in  Weather,  one  variable  is  initialized  by  one  processor  and  then  read  by  all  of 
the  other  processors.  Consequently  the  limited  directory  scheme  suffers  form 
hot-spot  access  to  this  location.  As  is  evident  from  Table  2,  the  LimitLESS 
protocol  avoids  the  sensitivity  displayed  by  limited  directories. 


5  Using  Multithreading  to  Tolerate  Latency 

While  dynamic  data  relocation  through  caches  reduces  the  average  memory 
access  latency,  a  fraction  of  memory  transactions  require  service  from  remote 
memory  modules.  When  transactions  cause  the  cache  coherence  protocol  to 
issue  invalidation  messages,  the  remote  memory  access  latency  is  especially 
high.  If  the  resulting  remote  memory  access  latency  is  much  longer  than  the 
time  between  memory  accesses,  processors  spend  most  of  their  time  waiting 
for  memory  transactions  to  be  serviced.  Processor  idle  time  also  results  from 
synchronization  delays. 
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One  solution  allows  the  processor  to  have  multiple  outstanding  remote  mem¬ 
ory  accesses  or  synchronization  requests.  Alewife  implements  this  solution  by 
using  a  processor  that  can  rapidly  switch  between  multiple  threads  of  computa¬ 
tion,  and  a  cache  controller  that  supports  multiple  outstanding  requests.  The 
controller  forces  a  context  switch  when  a  thread  issues  a  remote  transaction  or 
suffers  an  unsuccessful  synchronization  attempt.  Processors  that  rapidly  switch 
between  multiple  threads  of  computation  are  called  mulitihreaded  architectures. 

The  prototypical  multithreaded  architecture  is  the  HEP  [35].  In  the  HEP,  the 
processor  switches  every  cycle  between  eight  processor-resident  threads.  Cycle- 
by-cycle  interleaving  of  threads  is  also  used  in  other  designs  [31,  18].  Such 
architectures  are  termed  finely  multithreaded.  Although  fine  multithreading 
offers  the  potential  of  high  processor  utilization,  it  results  in  relatively  poor 
scalar  performance  observed  by  any  single  thread,  when  there  is  not  enough 
parallelism  to  fill  all  of  the  hardware  contexts. 

In  contrast,  Alewife  employs  block  multithreading  or  coarse  multithreading  - 
context  switches  occur  only  when  a  thread  executes  a  memory  request  that 
must  be  serviced  by  a  remote  node  in  the  multiprocessor.  Context  switches  are 
also  forced  when  a  thread  encounters  a  delay  due  to  a  synchronization  variable 
access.  Thus,  as  long  as  a  thread’s  memory  requests  hit  in  the  cache  or  can 
be  serviced  by  a  local  memory  module,  the  thread  continues  to  execute.  Block 
multithreading  allows  a  single  thread  to  benefit  from  the  maximum  performance 
of  the  processor. 

A  multithreaded  architecture  is  not  free  in  terms  of  either  its  hardware  or 
software  requirements.  The  implementation  of  such  an  architecture  requires 
multiple  register  sets  or  some  other  mechanism  to  allow  fast  context  switches, 
additional  network  bandwidth,  support  logic  in  the  cache  controller,  and  ex¬ 
tra  complexity  in  the  thread  scheduling  mechanism.  Other  methods,  such  as 
weak  ordering  [12,  1,  26],  incur  similar  implementation  complexities  in  the  cache 
controller  to  allow  multiple  outstanding  requests.  In  Alewife,  because  the  same 
context-switching  mechanism  is  used  for  fast  traps  and  for  masking  synchro¬ 
nization  latencies  zis  well,  we  feel  the  extra  complexity  is  justified. 


5.1  Implementing  a  Multithreaded  Processor 

This  section  describes  the  implementation  of  the  Sparcle  processor  and  evaluates 
its  potential  in  masking  communication  latency.  Alewife ’s  processor  is  designed 
to  meet  several  objectives:  it  must  context  switch  rapidly;  it  must  support  fast 
trap  dispatching;  and  it  must  provide  fine-grain  synchronization. 

Alewife ’s  block  multithreaded  processor  uses  multiple  register  sets  to  implement 
fast  context  switching.  The  same  rapid  switching  mechanism  coupled  with 
widely-spaced  trap  vectors  minimizes  the  delay  between  the  trap  signal  and  the 
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execution  of  the  trap  code.  The  wide  spacing  between  trap  dispatch  points 
allows  inlining  of  common  trap  routines  at  the  dispatch  point.  The  processor 
supports  word-level  full-empty  bit  synchronization.  On  a  synchronization  fault, 
the  trap  handling  routine  can  respond  by: 

1.  spinning  -  immediately  return  from  the  trap  and  retry  the  trapping  in¬ 
struction. 

2.  switch  spinning  -  context  switch  without  unloading  the  trapped  thread. 

3.  blocking  -  unload  the  thread. 

Sparcle  is  based  on  the  following  modifications  to  the  SPARC  architecture. 

•  Register  windows  in  the  SPARC  processor  permit  a  simple  implemen¬ 
tation  of  block  multithreading.  A  window  is  allocated  to  each  thread. 
The  current  register  window  is  altered  via  SPARC  instructions  (SAVE  and 
RESTORE).  To  effect  a  context  switch,  the  trap  routine  saves  the  Program 
Counter  (PC)  and  Processor  Status  Register  (PSR),  flushes  the  pipeline, 
and  sets  the  Frame  Pointer  (FP)  to  a  new  register  window.  [4]  shows 
that  even  with  a  low-cost  implementation,  a  context  switch  can  be  done 
in  about  11  cycles.  By  maintaining  a  separate  PC  and  PSR  for  each 
context,  a  custom  processor  could  switch  contexts  even  faster.  We  show 
that  even  with  11  cycles  of  overhead  and  four  processor  resident  contexts, 
multithreading  significantly  improves  the  system  performance.  See  [40] 
for  additional  evidence  of  the  success  of  multithreaded  processors. 

•  The  effect  of  multiple  hardware  contexts  in  the  SPARC  floating-point  unit 
is  achieved  by  modifying  floating-point  instructions  in  a  context  depen¬ 
dent  fashion  as  they  are  loaded  into  the  FPU  and  by  maintaining  four 
different  sets  of  condition  bits.  A  modification  of  the  SPARC  proces¬ 
sor  will  make  the  context  window  pointer  available  externally  to  allow 
insertion  into  the  FPU  instruction. 

•  Sparcle  detects  unresolved  futures  through  word- alignment  and  non- 
fiznnm  traps. 

•  The  SPARC  definition  includes  the  Alternate  Space  Indicator  (ASI)  fea¬ 
ture  that  permits  a  simple  implementation  of  the  general  interface  with 
the  controller.  The  ASI  is  available  externally  as  an  eight-bit  field  and 
is  set  by  special  SPARC  load  and  store  instructions  (LOA  and  STA).  By 
examining  the  processor’s  ASI  bits  during  memory  accesses,  the  controller 
can  select  between  different  load/store  and  synchronization  behavior. 
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•  Through  use  of  the  Memory  Exception  (MEXC)  line  on  SPARC,  it  can 
invoke  synchronous  traps  and  rapid  context  switching.  Sparcle  adds  mul¬ 
tiple  synchronous  trap  lines  for  rapid  trap  dispatch  to  common  routines. 
The  controller  can  suspend  processor  execution  using  the  MHOLD  line. 
Inter-processor  interrupts  are  implemented  via  asynchronous  traps. 

5.2  Simulation  Results  and  Analysis 

We  compare  the  behavior  of  a  multithreaded  architecture  to  a  standard  con¬ 
figuration,  and  analyze  how  synchronization,  local  memory  access  latency,  and 
remote  memory  access  latency  contribute  to  the  run  time  of  each  application. 
See  [3]  for  additional  analyses. 

A  thorough  evaluation  of  multithreading  will  require  a  large  parallel  machine 
and  a  scheduler  optimized  for  multithreaded  multiprocessors.  On  the  largest 
machines  we  can  reasonably  simulate  (around  64  processors)  and  with  our  cur¬ 
rent  scheduler,  the  scheduling  cost  of  threads  generally  outweighs  the  benefits 
of  latency  tolerance.  Furthermore,  the  locality  enhancement  afforded  by  our 
caches  and  the  runtime  system  diminishes  the  effect  of  non-local  communica¬ 
tions.  Indeed,  multithreading  is  expected  to  be  the  last  line  of  defense  when 
locality  enhancement  has  failed.  However  it  is  still  possible  to  observe  the 
benefits  of  multithreading  for  phases  of  applications  with  poor  communication 
locality. 

Our  simulation  results  are  derived  from  both  post-mortem  scheduled  and  full 
system  simulation  branches  of  ASIM.  The  post-mortem  scheduled  runs  use 
traces  of  SIMPLE  and  Weather  as  described  in  Section  4.4  and  the  full  system 
simulations  represent  a  transpose  phase  for  a  256  x  256  matrix.  In  addition  to 
determining  the  execution  time  of  an  application,  the  multiprocessor  simulator 
generates  raw  statistics  that  measure  an  application’s  memory  turcess  patterns 
and  the  utilization  of  various  system  resources.  We  will  use  these  statistics  to 
explain  the  performance  of  our  multithreaded  architecture.  The  simulations 
reported  in  the  following  sections  use  the  parameters  listed  in  Table  3. 

5.3  Effect  of  Multithreading 

Table  4  shows  the  run  times  for  the  various  applications  using  one  and  two 
threads  per  processor.  SIMPLE  and  Weather  realize  about  a  20%  performance 
increase  from  multithreading.  Since  neither  of  the  application  problem  sets  are 
large  enough  to  sustain  more  than  128  contexts,  no  performance  gain  results 
from  increasing  the  number  of  contexts  from  two  to  three  per  processor.  For 
the  matrix  transpose  phase,  we  realize  a  performance  gain  of  about  20%  with 
2  threads  and  25%  with  four  threads. 
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Number  of  Processing  Elements 

64 

Cache  Coherence  Protocol 

LimitLESS4 

Cache  Size 

64KB  (4096  lines) 

Cache  Block  Size 

16  bytes 

Network  Topology 

2  Dimensional  Mesh  (8  x  8) 

Network  Channel  Width 

16 

Network  Speed 

2  X  processor  speed 

Memory  Latency 

5  processor  cycles 

Context  Switch  Time 

11  processor  cycles 

Table  3:  Default  Simulation  Parameters 


Application 

Contexts 

Time 

SIMPLE 

1 

2440123 

2 

2034963 

Weather 

1 

1405536 

2 

1150325 

Transpose 

1 

172242 

2 

141571 

4 

129450 

Table  4:  Effect  of  Multithreading 


5.4  Cost  Analysis 

An  analysis  of  the  costs  of  memory  transactions  confirms  the  intuition  that  a 
multithreaded  architecture  yields  better  performance  by  reducing  the  effect  of 
interprocessor  communication  latency.  We  refine  the  simulator  statistics  into 
the  costs  of  four  basic  types  of  transactions. 

1.  Application  transactions  are  the  memory  requests  issued  by  the  program 
running  on  the  system.  These  transactions  are  the  memory  operations  in 
the  original  unscheduled  trace. 

2.  Synchronization  transactions  are  memory  requests  that  implement  the 
barrier  executed  at  the  end  of  a  parallel  segment  of  the  application. 

3.  Local  cache  miss  transactions  occur  when  an  application  or  synchroniza¬ 
tion  transaction  misses  in  the  cache,  but  can  be  serviced  in  the  local 
memory  module. 

4.  Remote  transactions  occur  when  an  application  or  synchronization  trans¬ 
action  misses  in  the  cache  or  requires  a  coherence  action,  resulting  in 
a  network  transmission  to  a  remote  memory  module.  Multithreading  is 
designed  to  alleviate  the  latency  caused  by  this  type  of  transaction. 


SIMPLE 

Weather 

Transaction  Type 

1  Thread 

2  Threads 

1  Thread 

2  Threads 

Application 

1.00 

Synchronization 

1.17 

1.08 

Local  Cache  Miss 

0.36 

mEM 

Remote 

3.98 

2.83 

1.25 

0  P4 

Total 

6.56 

5.27 

3.35 

2.75 

Table  5:  Memory  access  costs,  normalized  to  application  transactions. 


The  contribution  of  each  type  of  transaction  to  the  time  needed  to  run  an 
application  is  equal  to  the  number  of  transactions  multiplied  by  the  average 
latency  of  the  transaction.  We  assume  that  the  latency  cf  application  and 
synchronization  transactions  is  equal  to  1  cycle,  while  the  simulator  collects 
statistics  that  determine  the  average  latency  of  the  cache  miss  transactions. 
Table  5  shows  the  cost  of  each  transaction  type,  normalized  to  the  number 
of  application  transactions  for  SIMPLE  and  Weather.  For  example,  in  the 
simulation  of  SIMPLE  with  one  context  per  processor,  the  memory  system 
spends  an  average  of  3.98  cycles  servicing  remote  transactions  for  every  cycle 
it  spends  servicing  an  application  data  access. 

The  statistics  in  Table  5  ate  calculated  directly  from  the  raw  statistics  generated 
by  the  multiprocessor  simulator,  except  for  the  cost  of  remote  transactions  in 
the  multithreaded  environment.  A  multithreaded  eirchitecture  can  overlap  some 
of  the  cycles  spent  servicing  remote  transactions  with  useful  work  performed 
by  switching  to  an  active  thread.  We  approximate  the  number  of  cycles  that 
are  overlapped  from  the  average  remote  transaction  latency,  the  context  switch 
overhead,  and  the  number  of  remote  transactions.  The  number  of  overlapped 
cycles  is  subtracted  from  the  latency  of  remote  transactions  in  order  to  adjust 
the  cost  of  remote  transactions.  For  all  of  the  simulations  summarized  in  the 
table,  the  total  cost  multiplied  by  the  number  of  application  cycles  is  within 
5%  of  the  actual  number  of  cycles  needeu  to  execute  the  application. 

The  analysis  shows  that  remote  transactions  contribute  a  large  percentage  of 
the  cost  of  running  an  application.  This  conclusion  agrees  with  the  premise 
that  communication  between  processors  significantly  affects  the  speed  of  a  mul¬ 
tiprocessor.  The  multithreaded  architecture  realizes  higher  speed-up  than  the 
standard  configuration,  because  it  reduces  the  cost  of  remote  transactions.  Be¬ 
cause  communication  latency  grows  with  the  number  of  processors  in  a  system, 
the  relative  cost  of  remote  transactions  increases.  This  trend  indicates  that  the 
effect  of  multithreading  becomes  more  significant  as  system  size  increases. 
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6  Related  Work 


A  hardware  approach  to  the  automatic  reduction  of  non-local  references  that  has 
achieved  wide  success  in  small-scale  shared-memory  systems  is  the  use  of  high¬ 
speed  caches  to  hold  local  copies  of  data  needed  by  the  processor.  The  memory 
consistency  problem  can  be  solved  effectively  on  bus-baised  machines  [15,  38]  by 
exploiting  their  broadcast  capabilities,  but  buses  are  bandwidth  limited.  Hence 
most  shared-memory  machines  that  deal  with  more  than  8  or  16  processors  do 
not  support  caching  of  shared  data  [17,  14,  32,  24). 

Some  recent  efforts  propose  to  circumvent  the  bandwidth  limitation  through 
various  arrangements  of  buses  and  networks  [41,  16,  26,  10].  However,  buses 
cannot  keep  pace  with  improving  processor  technologies,  because  they  suffer 
from  clocking  speed  limitations  in  multidrop  transmission  environments.  The 
DASH  architecture  does  not  really  require  the  bus  broadcast  capability;  rather, 
it  uses  a  full-map  directory  scheme  to  maintain  cache  consistency.  In  contrast, 
Alewife  is  exploring  the  use  of  the  LimitLESS  directory  for  cache  coherence, 
where  the  directory  memory  requirements  grow  as  ©(A^logiV)  with  machine 
size 

Chained  directory  protocols  [20]  are  scalable  in  terms  of  their  memory  require¬ 
ments,  but  they  suffer  from  high  invalidation  latencies,  because  invalidations 
must  be  transmitted  serially  down  the  links.  It  is  possible  to  use  a  block  mul¬ 
tithreaded  processor  such  as  Sparcle  to  mask  the  latency,  or  by  implementing 
some  form  of  combining.  Accordingly,  we  have  observed  that  chaining  scheme 
enjoys  a  larger  relative  benefit  from  multithreading  than  the  LimitLESS  scheme. 
Chained  protocols  also  require  additional  traffic  to  prevent  fragmentation  of  the 
linked  lists  when  cache  locations  are  replaced.  Furthermore,  chained  directory 
protocols  lack  the  LimitLESS  protocol’s  ability  to  couple  closely  with  a  multi¬ 
processor’s  software. 

Although  caches  are  successful  in  automatic  locality  management  in  many  en¬ 
vironments,  they  are  not  a  panacea.  Caches  rely  on  a  very  simple  heuristic  to 
improve  communication  locality.  On  a  memory  request,  caches  retain  a  local 
copy  of  the  datum  in  the  hope  that  the  processor  will  reuse  it  before  some  other 
processor  attempts  to  write  to  the  same  location.  Thus  repeat  requests  are  sat¬ 
isfied  entirely  within  the  node,  and  communication  locality  is  enhanced  because 
remote  requests  are  avoided.  Caching  and  the  associated  coherence  algorithms 
can  be  viewed  as  a  mechanism  for  replicating  and  migrating  data  objects  close 
to  where  they  are  used.  Unfortunately,  the  same  locality  management  heuristic 
is  ill-suited  to  programs  with  poor  data  reuse;  attempts  by  the  programmer 
or  compiler  to  maximize  the  potential  reuse  of  data  will  not  benefit  all  ap¬ 
plications.  In  such  environments,  the  ability  to  enhance  the  communication 
locality  of  references  that  miss  in  the  cache  and  the  ability  to  tolerate  latencies 
of  non-local  accesses  are  prerequisites  for  achieving  scalability. 
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The  Alewife  effort  is  unique  in  its  multilayered  approach  to  locality  manage¬ 
ment;  the  compiler,  runtime  system  and  caches  share  the  responsibility  of  in¬ 
telligent  partitioning  and  placement  of  data  and  processes  to  maximize  com¬ 
munication  locality.  The  block  multithreaded  processors  mitigate  the  effects  of 
unavoidable  remote  communication  with  their  ability  to  tolerate  latency. 


7  Perspective  and  Summary 

The  class  of  MIMD  machines  is  composed  mainly  of  shared  memory  multi¬ 
processors  and  message  passing  multicomputers.  In  the  past,  machine  realiza¬ 
tions  of  shared  memory  multiprocessors  corresponded  closely  with  the  shared- 
memory  programming  model.  Although  the  network  took  many  forms,  such 
as  buses  and  multistage  networks,  shared  memory  was  uniformly  accessible  by 
all  the  processors,  closely  reflecting  the  programmer’s  viewpoint.  It  was  rela¬ 
tively  easy  to  write  parallel  programs  for  such  machines  because  the  uniform 
implementation  of  shared  memory  did  not  require  careful  placement  of  data  and 
processes.  However,  it  has  become  abundantly  clear  that  such  architectures  do 
not  scale  to  more  than  few  tens  of  processors,  because  an  efficient  implementa¬ 
tion  of  uniform  memory  access  is  infeasible  due  to  physical  constraints. 

Message  passing  machines,  on  the  other  hand,  were  built  to  closely  match  phys¬ 
ical  constraints,  and  message  passing  was  the  computational  model  of  choice. 
In  this  model,  no  attempt  was  made  to  provide  uniform  access  to  all  of  memory, 
rather,  access  was  limited  to  local  memory.  Communication  with  remote  nodes 
required  the  explicit  use  of  messages.  Because  they  allowed  the  exploitation  of 
locality,  the  performance  of  such  architectures  scaled  with  the  size  of  the  ma¬ 
chine  for  applications  that  displayed  communication  locality.  Unfortunately, 
the  onus  of  managing  locality  was  relegated  to  the  user.  The  programmer  not 
only  had  to  worry  about  partitioning  and  placing  data  and  processes  to  mini¬ 
mize  expensive  message  transmissions,  but  also  had  to  overcome  the  limitations 
of  the  small  amount  of  memory  within  a  node. 

Recent  designs  reflect  an  increased  awareness  of  the  importance  of  simultane¬ 
ously  exploiting  locality  and  reducing  programming  difficulty.  Accordingly,  we 
see  a  confluence  in  MIMD  machine  architectures  with  the  emergence  of  dis¬ 
tributed  shared-memory  architectures  that  allow  the  exploitation  of  communi¬ 
cation  locality,  and  message  passing  architectures  with  global  addressability.  A 
major  challenge  in  such  designs  is  the  management  of  communication  locality. 

Alewife  is  a  distributed  shared-memory  architecture  that  allows  the  exploita¬ 
tion  of  locality  through  the  use  of  mesh  networks.  Alewife’s  network  interface 
is  message  oriented,  while  the  processor  interfawre  with  the  rest  of  the  system  is 
memory  reference  oriented.  Alewife’s  approach  to  locality  management  is  mul¬ 
tilayered,  enccmpassing  the  compiler,  the  runtime  system,  and  the  hardware. 
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While  a  more  general  compiler  system  is  being  developed,  we  have  been  exper¬ 
imenting  with  applications  with  special  structure.  Prasanna  [33]  has  developed 
a  compiler  for  expressions  of  matrix  operations  and  FFTs.  The  system  exploits 
the  known  structure  of  such  computations  to  derive  near-optimal  process  par¬ 
titions  and  schedules.  The  Matexpr  program  used  in  Section  4  was  produced 
by  this  system.  The  speedups  measured  on  ASIM  with  this  system  outstrips 
the  performance  of  parallel  programs  written  using  traditional  heuristics 

A  runtime  system  for  Alewife  is  operational.  The  system  implements  dynamic 
proccs..  partitioning  and  near-neighbor  tree  scheduling.  The  tree  scheduler  cur¬ 
rently  uses  the  simple  heuristic  that  threads  closely  related  through  their  control 
flow  are  highly  likely  to  communicate  with  each  other.  For  many  applications 
written  in  a  functional  style  with  the  use  of  futures  for  synchronization  the 
assumption  is  largely  true,  and  the  performance  is  superior  to  that  of  a  ran¬ 
domized  scheduler. 

Caches  are  useful  in  enhancing  locality  for  applications  where  there  is  a  signifi¬ 
cant  amount  of  reuse  (assuming  locality  is  related  to  the  frequency  and  distance 
of  remote  communications).  The  LimitLESS  directory  scheme  solves  the  cache 
coherence  problem  in  Alewife.  This  scheme  is  scalable  in  terms  of  its  directory 
memory  use,  and  its  performance  is  close  to  that  of  a  full-map  directory  scheme. 

The  performance  gap  between  LimitLESS  and  full-map  is  expected  to  become 
even  smaller  as  the  machine  scales  in  size.  Although  in  a  64-node  machine,  the 
software  handling  cost  of  LimitLESS  traps  is  of  the  sane  order  as  the  remote 
transaction  latency  of  hardware-handled  requests  (about  50  cycles),  the  intern¬ 
ode  communication  latency  in  much  larger  systems  will  be  much  more  significant 
than  the  procr  ssors’  interrupt  handling  latency.  Furthermore,  improving  pro 
cessor  technology  will  make  the  software  handling  cost  even  less  significant.  If 
both  processor  speeds  and  multiprocessor  sizes  increase,  handling  cache  coher¬ 
ence  completely  in  software  will  become  a  viable  option.  Indeed,  the  LimitLESS 
protocol  is  the  first  step  on  the  migration  path  towards  interrupt-driven  cache 
coherence. 

Latency  tolerance  through  the  use  of  block  multithreaded  processors  is  Alewife’s 
iMt  line  of  defense  when  the  other  layers  of  the  system  are  unable  to  minimize 
the  latency  of  memory  requests.  The  multithreaded  scheme  allows  us  to  mask 
both  memory  and  synchronization  delays.  The  hardware  support  needed  for 
block  multithreaded  also  makes  trap  handling  efficient. 

The  design  of  Alewife  is  in  progress  and  a  detailed  simulator  called  ASIM  is  op¬ 
erational.  The  Sparcle  processor  has  been  designed;  its  implementation  through 
modifications  to  an  existing  LSI  Logic  SFARC  processor  is  in  progress.  A  signif¬ 
icant  portion  of  the  software  system,  including  the  dynamic  partitioning  scheme 
and  the  tree  scheduler,  is  implemented  and  runs  on  ASIM.  The  Alewife  com¬ 
piler  currently  accepts  hand  partitioning  and  placement  of  data  and  threads; 
ongoing  work  focuses  on  automating  the  partitioning  and  placement.  Several 
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applications  have  been  written,  compiled,  and  executed  on  our  simulation  sys¬ 
tem. 
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